Jonathan L. Elsas. An Evaluation of Projection Techniques for Document Clustering: Latent Semantic Analysis and Independent Component
نویسندگان
چکیده
Dimensionality reduction in the bag-of-words vector space document representation model has been widely studied for the purposes of improving accuracy and reducing computational load of document retrieval tasks. These techniques, however, have not been studied to the same degree with regard to document clustering tasks. This study evaluates the effectiveness of two popular dimensionality reduction techniques for clustering, and their effect on discovering accurate and understandable topical groupings of documents. The two techniques studied are Latent Semantic Analysis and Independent Component Analysis, each of which have been shown to be effective in the past for retrieval purposes.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملComparing Dimension Reduction Techniques for Document Clustering
In this research, a systematic study is conducted of four dimension reduction techniques for the text clustering problem, using five benchmark data sets. Of the four methods -Independent Component Analysis (ICA), Latent Semantic Indexing (LSI), Document Frequency (DF) and Random Projection (RP) -ICA and LSI are clearly superior when the k-means clustering algorithm is applied, irrespective of t...
متن کاملHierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics
This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...
متن کاملComparing and Combining Dimension Reduction Techniques for Efficient Text Clustering
A great challenge of text mining arises from the increasingly large text datasets and the high dimensionality associated with natural language. In this research, a systematic study is conducted of six Dimension Reduction Techniques (DRT) in the context of the text clustering problem using three standard benchmark datasets. The methods considered include three feature transformation techiques, I...
متن کاملDimensionality Reduction Techniques for Document Clustering- A Survey
Dimensionality reduction technique is applied to get rid of the inessential terms like redundant and noisy terms in documents. In this paper a systematic study is conducted for seven dimensionality reduction methods such as Latent Semantic Indexing (LSI), Random Projection (RP), Principle Component Analysis (PCA) and CUR decomposition, Latent Dirichlet Allocation(LDA), Singular value decomposit...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005